{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bayesian Statistics Made Simple\n",
    "\n",
    "Code and exercises from my workshop on Bayesian statistics in Python.\n",
    "\n",
    "Copyright 2019 Allen Downey\n",
    "\n",
    "MIT License: https://opensource.org/licenses/MIT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If we're running on Colab, install empiricaldist\n",
    "# https://pypi.org/project/empiricaldist/\n",
    "\n",
    "import sys\n",
    "IN_COLAB = 'google.colab' in sys.modules\n",
    "\n",
    "if IN_COLAB:\n",
    "    !pip install empiricaldist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import seaborn as sns\n",
    "sns.set_style('white')\n",
    "sns.set_context('talk')\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from empiricaldist import Pmf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The World Cup problem\n",
    "\n",
    "In the 2018 FIFA World Cup final, France defeated Croatia 4 goals to 2.  Based on this outcome, we can answer the following questions:\n",
    "\n",
    "1. How confident should we be that France is the better team?\n",
    "\n",
    "2. If the same teams played again, what is the chance Croatia would win?\n",
    "\n",
    "To answer these questions, we have to make some modeling assumptions:\n",
    "\n",
    "1. Goal scoring can be well modeled by a Poisson process, so the distribution of goals scored by each team against the other is Poisson($\\lambda$), where $\\lambda$ is a goal-scoring rate, measured in goals per game.\n",
    "\n",
    "2. For two random World Cup teams, the distribution of goal scoring rates is Gamma($\\alpha$), where $\\alpha$ is a parameter we can choose based on past results.\n",
    "\n",
    "To determine $\\alpha$, I used [data from previous World Cups](https://www.statista.com/statistics/269031/goals-scored-per-game-at-the-fifa-world-cup-since-1930/) to estimate that the average goal scoring rate is about 1.4 goals per game.\n",
    "\n",
    "We can use `scipy.stats.gamma` to compute the PDF of the Gamma distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import gamma\n",
    "\n",
    "α = 1.4\n",
    "qs = np.linspace(0, 6)\n",
    "ps = gamma(α).pdf(qs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use `qs` and `ps` to make a `Pmf` that represents the prior distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "prior = Pmf(ps, index=qs)\n",
    "prior.normalize()\n",
    "prior.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And plot it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def decorate_rate(title):\n",
    "    \"\"\"Labels the axes.\n",
    "    \n",
    "    title: string\n",
    "    \"\"\"\n",
    "    plt.xlabel('Goal scoring rate')\n",
    "    plt.ylabel('PMF')\n",
    "    plt.title(title)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "prior.plot()\n",
    "decorate_rate('Prior distribution')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This prior implies:\n",
    "\n",
    "1. The most common goal-scoring rates are near 1.\n",
    "\n",
    "2. The goal-scoring rate is never 0; eventually, any team will score against any other.\n",
    "\n",
    "3. The goal-scoring rate is unlikely to be greater than 4, and never greater than 6."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The likelihood function\n",
    "\n",
    "Suppose you are given the goal-scoring rate, $\\lambda$, and asked to compute the probability of scoring a number of goals, $k$.  The answer is given by the Poisson PMF:\n",
    "\n",
    "$ \\mathrm{PMF}(k; \\lambda) = \\frac{\\lambda^k \\exp(-\\lambda)}{k!} $\n",
    "\n",
    "**Exercise 1:** Write a likelihood function that takes $k$ and $\\lambda$ as parameters `data` and `hypo`, and computes $\\mathrm{PMF}(k; \\lambda)$.\n",
    "\n",
    "You can use NumPy/SciPy functions or `scipy.stats.poisson`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def likelihood(data, hypo):\n",
    "    \"\"\"Likelihood function for World Cup\n",
    "    \n",
    "    data: integer number of goals in a game\n",
    "    hypo: goal scoring rate in goals per game\n",
    "    \n",
    "    returns: float probability\n",
    "    \"\"\"\n",
    "    # TODO: fill this in!\n",
    "    return 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Update\n",
    "\n",
    "First we'll compute the posterior distribution for France, having seen them score 4 goals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "france = Pmf(prior, copy=True)\n",
    "\n",
    "france.update(likelihood, 4)\n",
    "france.plot(label='France')\n",
    "decorate_rate('Posterior distribution, 4 goals')\n",
    "\n",
    "france.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 2:** Do the same for Croatia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Solution\n",
    "\n",
    "croatia = Pmf(prior, copy=True)\n",
    "\n",
    "croatia.update(likelihood, 2)\n",
    "croatia.plot(label='Croatia', color='C3')\n",
    "decorate_rate('Posterior distribution, 2 goals')\n",
    "\n",
    "croatia.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Probability of superiority\n",
    "\n",
    "Now that we have a posterior distribution for each team, we can answer the first question: How confident should we be that France is the better team?\n",
    "\n",
    "In the model, \"better\" means having a higher goal-scoring rate against the opponent.  We can use the posterior distributions to compute the \"probability of superiority\", which is the probability that a random value drawn from France's distribution exceeds a value drawn from Croatia's.\n",
    "\n",
    "Remember that `Pmf` provides `choice`, which returns a random sample as a NumPy array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_france = france.choice(size=1000)\n",
    "sample_france.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 3:** Generate a similar sample for Croatia; then compute the fraction of samples where the goal-scoring rate is higher for Croatia.  \n",
    "\n",
    "Hint: use `np.mean`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the basis of one game, we have only moderate confidence that France is actually the better team."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predicting the rematch\n",
    "\n",
    "Now we can take on the second question: If the same teams played again, what is the chance Croatia would win?\n",
    "\n",
    "To answer this question, we'll generate a sample from the \"posterior predictive distribution\", which is the number of goals we expect a team to score.\n",
    "\n",
    "If we knew the goal scoring rate, $\\lambda$, the distribution of goals would be $Poisson(\\lambda)$.\n",
    "\n",
    "Since we don't know $\\lambda$, we can use the sample we generated in the previous section to generate a sample of goals, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "goals_france = np.random.poisson(sample_france)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can plot the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def decorate_goals(title):\n",
    "    \"\"\"Labels the axes.\n",
    "    \n",
    "    title: string\n",
    "    \"\"\"\n",
    "    plt.xlabel('Goals scored')\n",
    "    plt.ylabel('PMF')\n",
    "    plt.ylim([0, 0.32])\n",
    "    plt.title(title)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "pmf_france = Pmf.from_seq(goals_france)\n",
    "pmf_france.bar(label='France')\n",
    "decorate_goals('Predictive distribution')\n",
    "plt.legend()\n",
    "\n",
    "goals_france.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This distribution represents two sources of uncertainty: we don't know the actual value of $\\lambda$, and even if we did, we would not know the number of goals in the next game."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 4:** Generate and plot the predictive distribution for Croatia."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a sense, these distributions represent the outcomes of 1000 simulated games."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 5:** Compute the fraction of simulated rematches Croatia would win, how many France would win, and how many would end in a tie."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Solution goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assuming that Croatia wins half of the ties, their chance of winning the rematch is about 33%."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
